Method and system for extracting, analyzing, storing, comparing and reporting on data stored in web and/or other network repositories and apparatus to detect, prevent and obfuscate information removal from information servers

ABSTRACT

A system, method and apparatus providing for the search, identification, retrieval and analysis of data contained in World Wide Web (WWW) and network pages and storage repositories. Mechanisms are provided to facilitate selection of such data as is required by a user, to report in a manner required by the user and to present the results in a plurality of ways. Also disclosed is a system, method to protect information retrieval from Information Servers such as those found on the world wide web (WWW). A method is described to analyze accesses to the information server for patterns indicating the type of system accessing the server. A method is described to format information such that it cannot be easily machine analyzed by such apparatus as lexical analysis and textual search methods. A method is described to include information into information server contents such that it would mislead and otherwise confuse non-human systems used to retrieve the data. Other methods describe access signature analysis and how this can be used to detect and optionally prevent or modify information requests.

TECHNICAL FIELD

The present invention relates to a method of retrieving data from datasources and, in particular, to a method of retrieving data from datasources such as structured data sources, semi-structured data sourcesand unstructured data sources, many of which may be rapidly changing.The present invention further relates to a method of analyzing access toan information server and, optionally, acting upon the analysis.

BACKGROUND

The increased popularity of computer networks generally, and the WorldWide Web (WWW) in particular, has increased both the amount andcomplexity of data available. This poses great difficulty to thosewishing to find something of specific interest in a data repository. Thenow widespread existing world wide web search engines and extractiontools suffer from a number of drawbacks. For example, many searchengines base the ranking of results on payment by the data provider, asopposed to being based on the relevance of the result data to the searchquery. For example, one trying to decide where to purchase a dozen rosesusing current search engines and extraction tools may be presented witha wide variety of potentially misleading information, including possiblyone or two suppliers who have paid for their placement in the display ofsearch results.

Search engines and browsers only help users locate and inspectinformation That the search engine has cataloged, while tracking toolscan help users keep up-to-date on changes to pertinent information. Theability of a search engine to index information is compromised somewhatby the rapidly changing nature of the data being indexed. For example, auser of a search engine is many times directed by the search engine topages that are misleading, irrelevant or even no longer in use.Moreover, it is almost impossible to identify and automatically comparethe results obtained from the search without resorting to visualtechniques (i.e., looking at the pages).

There are conventional tools are available to facilitate the comparisonof pages in a web repository, detect changes in a web page andfacilitate a search for specific text. These tools have variouslimitations, though. In the “rose” example above, the rose seeker wouldbenefit from a mechanism to identify all the sources of roses andsimilar fauna in a geographical area and display the prices. A change inprice represents a change in the content of the page that could bedetected by conventional tools. However, tools that detect changes overtime are confused by the common practice of including the date or timein a web page, leading the tool to believe that the page had changed.Furthermore, pages created by Active Server Page (ASP) servers typicallyadd data or information to a page that is not normally visible using aweb browser. This non-visible information confuses conventional toolsthat search for information in a WWW page.

There are a number of conventional products that copy portions or theentire contents of WWW sites. Examples of these products—commonly termed“web mirrors”—include Teleport Pro, SiteEater and NetAttache. There arealso many web-based services that provide specific information on topicsof interest and compare data. Examples of these are the now ubiquitoussearch engines such as Lycos and Yahoo, and price “watching” sites suchas www.pricewatch.com and aggregation products such as those from NQLInc. For example, web mirror products download portions or the entirecontents of web sites and optionally perform some analysis of the databeing extracted. Some allow the user to search for text in the pages andothers detect differences between previous downloaded versions. Problemsand limitations include:

-   -   They can quickly get “lost” in the web. For example, if one is        looking for a file called myFile.zip in the archive site        www.bhs.com (a large and well known repository for the        Windows-NT operating system), a web mirror does not have the        intelligence to determine that the zip file is located on an        external “hidden” server (no domain name, just an IP address),        the link to which can be found only through several levels of        indirection. In its attempt to retrieve the myFile.zip file, a        web mirror product would typically attempt to download the        entire www.bhs.com site, attached advertisement servers, sundry        off-site references and ultimately the contents of the entire        web. The practice of putting the target file onto a hidden        server is common in web repositories and is intended to keep ASP        server design simple and confuse robots gathering the data. As        just discussed, this can be a very effective technique.    -   The text search can become confused by the now rampant practice        of including comments, control statements, Javascript and other        material into web pages.    -   The widespread practice of including date and time information        into a page means that the page tomorrow is different from the        page today, rendering the concept of detecting changes over time        virtually useless. Some current tools attempt to eliminate this        disadvantage by looking for and ignoring date fields, others by        letting the user exclude the field. However, even ignoring time        and date fields does not address the problems caused by very        widespread practice of changing advertisements on a web page.    -   Web mirror tools that compare web pages lack the sophistication        to convert the data into a contextually similar format such that        it can be compared. For example, considering the comparison of        the following text contained on two web sites: “Books for sale,        from $5.00” and “Romantic Novel Sale! Prices starting at $5.00”,        the English meaning of these two phrases is the same, yet the        text is different and would fail a text comparison and render        any form of automatic analysis extremely difficult. Another        example, considering the comparison of the following text        contained in two documents: “Burger, fries and large soda,        $4.95” and “Burgers, soda and fries for $5.00” contains a small        difference the meaning of which is entirely dependent on the        reader. Such variations in textual construction and similar        meaning are commonplace in documents such as those found on the        web and are to be expected.    -   Furthermore, services on the world wide web that provide        comparisons and analysis are becoming increasingly popular, yet        all of these are server side. That is, they are systems that        provide the service or data to other systems connected to them.        For example, a user would use a web browser to access data on a        server which provided price comparison information. The user        would be a client to the server providing the requested data.        Server side service providers typically employ a variety of        techniques to compare and provide data. These techniques        typically include:        -   Arranging some sort of paying relationship with the site(s)            that are providing data for the service.        -   Having a series of ad-hoc scripts and programs to gather            data.        -   Actually having people manually type or enter data into the            database.

Furthermore, there products that employ clervers and keyword searches inthe fields of data identification, one such product being available fromwww.opencola.com. Such products attempt to identify the relevancy ofdata in a data repository by comparing the data with a set of keywordsdefined by the user. The higher the number of keywords matches found,the more relevant the data is considered to be. Some such systems claimthe ability to self-learn additional keywords to be used in futurecomparisons.

There are many disadvantages of this technique, and particularly tousing keywords. A significant disadvantage is the inability to identifythe meaning of the keywords and the context in which they are used. Forexample, the words “lite” and “light” can be used in the context ofelectrical illumination as in “lite bulb” and “light bulb”. “Lite” isthe American English equivalent of the British English word “light” butalso has contextual meaning in terms of the mass of the object beingreferred to. Such dualities of meaning and spelling should be expected,as should be miss-spellings, grammatical errors and plural terms. Thisdocument has used the term “miss-spelling”, but could equally have usedthe term “misspelling” or “miss spelling”. Even using similar keywordcombinations can give rise to incorrect matches as in the exampletextual fragment “. . . contact Miss Spelling who has found your lostdog . . . ” Furthermore, the possibilities of plural terms, hyphenatedterms and possessive terms increase the number of keyword permutationsleading to an exponential increase in comparison time, storagerequirements and potential for error. Keyword comparisons lack theability to combine meaning and context and cannot easily or accuratelycope with the unknown multiplicity of combinations that are to beexpected in documents such as those found on the World Wide Web.

Keyword comparison systems not extracting the meaning of the data beingexamined make contextual analysis difficult, if not impossible,resulting in the inability to validate the accuracy of the data beingexamined. This lack of meaning and context can require significant humanexamination of the data and renders collaborative sharing with othersmore difficult.

Accuracy of a data item can be defined as the measure of differencebetween the meaning of a data item and that of a reference data item.Determining the accuracy of data from sources of unknown reliabilitytypically involves comparisons against other data considered to be of asimilar nature from a reliable sources. For example, a data itemexpressing the mathematical sum 2+2=5 is obviously false due to theoverwhelming amount of evidence to the contrary. Determining a measureof accuracy of information found in large, unstructured repositoriessuch as those on the World Wide Web (e.g., news sites) involvescross-referencing different data from many different sources producing alist of similar data. The larger the amount of corroborating data, thehigher the degree of reliability or accuracy. Since the nature of thedata is unknown, keyword searches as used in the art do not provide anindication of similarity or dissimilarity and can be consideredinaccurate.

On the other hand, a supplier of books on the world wide web (forexample) wants potential customers to access their web site, but doesnot want their competitors to download all the information for analysis.

SUMMARY

The invention relates to a system to recursively identify, selectivelyextract, compare, store and report on data from defined web and networkdata sources. The system provides mechanisms for a user to define datasources, to define extraction and rejection criteria, and to defineactions to be taken on the extracted data. In accordance with otheraspects of the invention, data stores are provided that dynamicallyadapt to the nature and meaning of the data that is stored in it in amanner that ascribes the meaning and context of the data with respect toother data.

In specific embodiments, users can input criteria which are then used tolocate and extract data in data sources on the world wide web and/orother computer networks. Once accessed, the data is analyzed usingcriteria defined by the user, optionally stored, and presented to theuser in a form defined by the user. The extraction, comparison andanalysis may be either “client side”, “server side”, or in a combinationcalled a “clerver” but should not be considered restricted to suchdevices or device combinations.

The use of user-defined data sets and set locations overcomes theproblems associated with ASP (Active Server Pages) pages generated byASP servers or other systems that generate a web page on request. Forexample, the user may define what data sets are to be checked for achange, rather than simply checking that the page has changed in anindeterminate manner. In addition, methods are provided to track changesin the location of the data. Once a data item has been located, itscontextual meaning, location and immediate environment may be recordedsuch that the data item can later be found even in the event that itslocation changes.

In some embodiments, the method operates based on a Universal DataLocator Object (UDIO) that defines a route to the data by effectivelyprohibiting the extraction engine from following links that do not leadto a particular type of data. This may involve a mixture of directingthe engine down a link to search for data and only allowing a specificlevel of depth and only a particular specified number of sites away fromthe start location. There may also be constraints on the breadth anddepth traversal to constrain the amount of data being examined. Theremay also be constraints on the time taken for such operations.

As for the comparison operation, in some embodiments, the data is“normalized” before comparison. The normalization in some embodimentsmay require access to potentially large quantities of data the meaningand context of which are obtained from the storage devices. Bynormalizing the data, the data to be compared is in a contextuallycompatible format that can then be used to generate reports, comparewith other contextually similar data, and other purposes that thespecific embodiment may require. In addition, by employing the context,the amount of data that needs to be considered is diminished.

Consumers and businesses can all benefit from a mechanism thatfacilitates the easy identification and comparison of data. For example,those performing electronic commerce can easily identify and cataloginformation on web repositories of their competitors.

In accordance with another aspect, the present invention is a method andapparatus that enable data requests to an information server to beanalyzed, logged and/or displayed before a plurality of optionalmodifications are made to the request and/or data supplied by theinformation server.

Furthermore, a system is provided to identify the nature and origin ofaccess to a networked data repository, allowing actions to be performedappropriate to the identified nature and origin of the access. Theidentification of hit origins allow information servers to be protectedfrom undesired usage of information accessed whilst allowing otherwisefull information availability.

Yet further, methods and apparatus are provided that present theinformation contained in an information server in a manner that makesthe information difficult or impossible to be easily analyzed by meansother than by direct human inspection (e.g., visually). Since the amountof information on information servers and repositories is large, humaninspection, analysis and comparison of such data would typically beprohibitively time consuming and such inspection would thus typically berequired to be performed by automated methods such as an electroniccomputer.

In accordance with an aspect of the invention, access records andinformation maintained on the information server relating to allaccesses to the server are analyzed to determine a hit signature foreach of the hit origins. The hit signature is analyzed to determine suchcharacteristics that would result in a probability of the hit origins.Information from the hit signatures is then used to control the accessto and information provided by an information server.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1. A diagram showing the various components of a simple electroniccomputer which embodiments could use to facilitate operation of theinvention.

FIG. 2. A diagram showing an overview of a system used to extract datafrom example data repositories typical of those to be found on the worldwide web.

FIG. 3. A diagram showing how a plurality of data locations appear in anexample World Wide Web page.

FIG. 4. A diagram showing data identifiers used to describe data blocks.

FIG. 5. A diagram showing the components of a Universal Parameter Object(UPO)

FIG. 6. A diagram showing the components of a Natural LanguageProcessor.

FIG. 7. A diagram showing different types of Natural Language.

FIG. 8. A diagram showing how Word Identifiers contain word meanings andequivalents and associated actions.

FIG. 9. A diagram showing how Word Identifiers map to Natural Language.

FIG. 10. A diagram showing Headword and Synonym relationships.

FIG. 11. A diagram showing example Word Classifications.

FIG. 12. A diagram showing word meaning comparisons.

FIG. 13. A diagram showing word expectation relationships.

FIG. 14. A diagram showing basic store types and triggers.

FIG. 15. A diagram illustrating interconnections of volatile andpersistent adaptive storage cells across a network and residing in thesame system.

FIG. 16. A diagram showing a storage application programming interface(API) addressing the incompatibilities and inconsistencies betweenvarious storage devices.

FIG. 17. A diagram showing an adaptive store including an Index and anAdaptive List referencing data objects.

FIG. 18. A diagram showing illustrating an adaptive store without anyindexing mechanism.

FIG. 19. A diagram depicting a simple access to an adaptive store.

FIG. 20. A diagram showing priority values in an adaptive store. FIG.20-1 illustrates an algorithm for accessing the adaptive FIG. 20adaptive store.

FIG. 21. A diagram showing a series of URL's in a chain that eventuallylocate a data item.

FIG. 22. A diagram showing the components of a Universal Data Identifier(UDI) that combines the parameters required to extract the data from arepository.

FIG. 23. A diagram showing the components of a Universal ResourceLocator UDI that combines the parameters required to extract the datafrom the world wide web.

FIG. 24. A diagram showing the components of the Client Data Extractorwhich uses UDIO objects from the UDI to extract data from datarepositories.

FIG. 25. A diagram showing the components of the world wide webextraction engine which uses UDIO objects from the UDI to extract datafrom world wide web data repositories.

FIG. 26. A diagram showing the components of the URL Download Managerthat determines if a URL's data should be downloaded and performs thedownload process which in preferred embodiments would involve the use ofthreads. Also shown are hit signature criteria.

FIG. 27. A diagram showing the components of the URL List Manager.

FIG. 28. A diagram showing the components of the Data Results Processor(DRP) which formats the data into a context required by the user.

FIG. 29. A diagram showing an example web page, and how data locationsare defined.

FIG. 30. A diagram showing timing properties to load a page into adisplay device such as a world wide web browser; properties of humanaccess to a world wide web page; and properties of non-human access to aworld wide web page.

FIG. 31. A diagram showing particular data identifiers to define thedata location.

FIG. 32. Page Hierarchy with Image Maps.

FIG. 33. A diagram showing methods employed to analyze hit signatures

FIG. 34. A diagram showing signature proximity values from which accessorigin probabilities can be determined.

FIG. 35. A diagram showing a hit index.

FIG. 36. A diagram showing knowledge sharing clervers connected on anetwork.

FIG. 37. A diagram showing adaptive stores interconnected in a number ofindependent networks.

FIG. 38. A diagram illustrating an example of how the FIG. 36interconnected adaptive stores interoperate.

FIG. 39. A diagram illustrating a further example of how the FIG. 36interconnected adaptive stores interoperate.

DETAILED DESCRIPTION

In accordance with one broad aspect, the present invention provides amethod executing on one or more computers to follow data into arepository based on data links and locator information provided by auser and/or from the repository, and extracting the data. The extracteddata may be analyzed and optionally compared with other contextuallysimilar data.

As used herein, the term “computer” is meant broadly and notrestrictively, to include any device or machine capable of acceptingdata, applying prescribed processes to the data, and supplying resultsof the processes. In accordance with some embodiments, the methodsdescribed herein are performed by a programmed and programmable computerof the type that is well known in the art, an example of which is shownin FIG. 1.

The computer system 108 shown in FIG. 1 has a display monitor 102, akeyboard 104, a pointing/clicking device 106, a processing unit 112, acommunication or network interface 114 (e.g., modem; ethernet adapter),other interfaces consistent with the application of the embodiment 110and an adaptive persistent storage device 116. The adaptive storage 116includes a storage area for the storage of computer program code and forthe storage of data and could be in the form of magnetic media such asfloppy disks or hard disks, optical media such as CD-ROM or other forms.

The processor is connected to the display 102, the keyboard 104, thepoint/clicking device 106, the interface 114, 110 and the storage device116. The interface 114, 110 provides a channel for communication withother computers and data sources linked together in a network of systemsor other apparatus capable of storing data and providing access to thestored data. In some embodiments, the network is the Internet (includingthe world wide web) and/or one or more intranets As used herein,“persistent storage” 116 includes any form of data storage including“adaptive storage” and “neural storage.” Data held in storage 116 may berepresented by a plurality of and combination of forms such ascompressed, encoded, encrypted and bare (i.e., unchanged).

Referring to FIG. 2, client nodes 200 and representative data sources202, 204, 210, 212, 214 are interconnected via a network connection 220.Although the client nodes 200 are shown separately from the data sources202, 204, 210, 212, 214 any node connected to the network may functionas either or both client of node and data source and may act eitherindependently or in concert with other interconnected client nodes anddata sources. For example, some or all of the nodes 200 may be anycomputer. In a typical embodiment, client nodes 200 include componentsto execute a software program, have one or more input devices such as akeyboard and mouse, one or more output devices such as a display andprinter and the ability to connect to a network such as the world wideweb.

The data sources may contain structured or unstructured data. Examplesof structured data include data in databases supporting the StructuredQuery Language (SQL) and repositories in which data is stored in apredefined manner, such as with predictive delimiting keys or tags.Examples of unstructured data sources include repositories containingtext in natural language, world wide web pages and data that is subjectto unknown or unpredictable change. Examples of natural language includenominally unstructured texts written in English and/or other languagescommonly spoken or written. Examples of data subject to unknown changesinclude world wide web sites containing weather forecasts; where thetime and nature of the change is unpredictable and unknown. Examples ofsemi-structured, unstructured and rapidly changing data sources exist onthe world wide web where data is often generated from databases and witha changing visual representation.

A client node 200 employs an identification/extraction specification toperform operations to identify and extract data from the datarepositories. A simple example is discussed, to extract pricinginformation from three URL's containing text, as shown below:

-   -   http://www.a.com/a.htm: “3 burgers, two fries and a large soda:        $4.99”    -   http://www.b.com/b.htm “Free large soda, 2 fries with three        burgers, unbeatable value at $4.87”    -   http://www.c.com/c.htm “$5.55 gets you large soda, 2 terrific        fries and 3 beefy burgers”

The specification includes the location of the data on the page, orinformation usable to determine the location of the data and informationfor extracting the data. In the above example, certain defined keywordsare used to locate a section of text of interest. In the above example,a search is made for keywords ‘burger’, ‘fries’, ‘soda’ and a numericprice of the format $x.yz. In addition, a search is also made for thepresence of a quantity operator that may precede the keyword such as‘three burgers’. Keyword comparisons are notorious in the art for beingslow and inaccurate. For example, a search on the keyword “burger” wouldnot match the contextually identical term “Big Mac”. It is a commonpractice to perform what are known as “partial matches” and such apartial match on the term “burger” would yield a correct match in thesentence “. . . beefy burgers”, the word “hamburger” but would alsoincorrectly match “burgermeister”, “burgerstrasse” and other terms. Asused herein, the term keyword refers to a singular or plurality ofcombinations of terms, identifiers, phrases, sentences, words and suchequivalents that are commonly used as a part of Natural Language.

Data locators may be specified in any of a number of forms, examples ofwhich are shown below: Type Description Cartesian The location of thedata is specified as a set of Cartesian co- ordinates describing arectangle encompassing the text or data of interest. This is generallyused only when the data locations are unchanging. E.g.: 10, 5 600, 70could be used to define all text from line 10, character position 5 toline 600, character position 70. The meaning of the co-ordinates isentirely dependant on the specific embodiment, the nature of the databeing extracted and the requirements of the user(s). Block The locationof data of interest is specified as a set of keywords or keywordsequences describing the start and end of a block of text. Thus thesequence “Find my text to identify.” could be identified using the startkeyword Find and the end keyword ‘.’ (period). This technique can beused to track text blocks when the location of the blocks is not fixed.Offset An offset in the form of Cartesian co-ordinates, characteroffset, byte offset, or keyword sequence can be used singularly or incombination to define an offset from a known location or from apreviously defined data location.

Data locators may be specified without regard for alphabeticcapitalization of the text and can be combined with Boolean operators(and, or). Each identifier may follow URL's contained in the text block.Each data locator may have a plurality of actions that would beperformed in the event that the locator was matched. Well-known textprocessing techniques—such as regular expressions, Prolog, Lisp, Lex andyacc—may be employed, as well as “adaptive contextual matching” devicesand methods as described in greater detail herein.

Describing the locations of data may utilize (or even require) a mixtureof the above types, and the nature of the descriptions used may varybetween embodiments. An example of how a plurality of data locations aredescribed is shown in FIG. 3, which includes four data items ofinterest. Items 300 and 302 are links to other pages of interest, to befollowed. Items 304 and 310 are text blocks of interest that are to beidentified for extraction. Items 306 and 308 contain information that isto be disregarded. The content of Items 304 and 310 and the URL'spointed to by Items 300 and 302 may change between accesses to the page.Item 300 is identified by a data locator looking for the keyword “Next”from the start of the page and the action(s) associated with the keywordare

-   -   (a) extract and store the URL as a URL to follow; and    -   (b) set an internal data cursor for the start of the next        locator search.

Item 302 is identified by a data locator looking for the keyword “next”from the data cursor and the action(s) associated with the keyword arethe same as (a) and (b) above. The keyword “next” in this example mayalso include any contextually similar term The boundaries of the blockof text A04 cannot be determined by Cartesian geometry, as the text datawill change and thus a data cursor is used to determine the startingpoint for a search. It is known that the text block starts with anumeric field of a currency type ($55,000) and ends with the keywordMLS1721. Thus, the block start is set to begin with and include acurrency type keyword and to end with and include a field of the type“MLSnnnn” where nnnn is an unknown amount of numeric characters. Thedata cursor is set to the end of the MLSnnn field. Such field delimitingis well known to those skilled in the art. The text block 310 is locatedin a similar manner as text block 304, as the subjective words “bargainat” in block 310 are to be disregarded. The data identifiers used todescribe these data blocks are shown in FIG. 4. The manner in which thetext and links are examined is described later with reference to FIG. 7.

The keywords are parameters to be included in a Universal ParameterObject (UPO). An embodiment of a UPO is illustrated in FIG. 5. Detailsof the context appropriate to the parameters may be sourced from aplurality of sources such as a Database Management System (DBMS) 500,the world wide web 502, User Supplied Object 504, Natural LanguageProcessor (NLP) 506, a Graphical User Interface (GUI) 508, a data file510 containing an existing UPO or containing flat data or other data ofexpected context, a Browser 512, or even an existing UPO contained in anAdaptive Store 516.

A world wide web page from the world wide web 502 could contain defaultinformation or other parameters that could be used by a user without theneed to explicitly define such parameters. In one embodiment, a web site(e.g., www.findbase.com) is accessed to retrieve parameters and otherinformation for the UPO, thus assisting an unskilled user who may not beable to easily determine the exact nature and location of theparameters. This technique also allows for a plurality of users to sharea set of parameters. Data indicating the context is validated 518against validation criteria indicated by the context of the originalparameters. Validated parameters forming a UPO 520 can be stored in aplurality of persistent object repositories 522 with a uniqueidentification code such as an alphanumeric name or index key.

In some embodiments, the GUI 508 is provided to facilitate correctoperation of the UPO. For example, the GUI may include a visual display,a pointing device such as a mouse, an entry device such as a keyboard,local storage for data and program and a network connection.

The NLP 506 may be employed to decode and extract contextual meaningfrom text originating as parameter definitions or text extracted from arepository. FIG. 6 shows the components for the Natural LanguageProcessor which can form part of a UPO 320, part of a GUI 308 orfunction in an independent manner such as taking textual input from anydata source such as a data file, speech, a Palm or other handheldcomputer. The NLP converts words, commands and parameters contained inNatural Language into a format for storage comparison and execution by asoftware program, an example of which could be the embodiment of thisinvention. Reference to FIG. 5 shows an example of commands andparameters contained in Natural Language. It should be noted that thecommands and parameters in the Natural Language could be contradictoryand conflicting and such contradictions and conflicts are resolved witha Natural Language Processor (FIG. 5). The meaning of Natural LanguageText (NLT) is dependent on the context applied when the text istranslated into its component parts. The first example of NLT, 700,shows a sentence in the English language describing some characteristicsof a house. Components of the sentence, called tokens, are separatedfrom each other by a delimiter character which in this instance is asingle space or a plurality of space. Those versed in the art willrecognize that other delimiter characters are used in the Englishlanguage and the space character could easily be one or a plurality ofthese characters. The sentence can be immediately recognized ascontaining a description of a dwelling with a number of bedrooms 702,704, a type of dwelling 706, a number of bathrooms 708, 710, a numericrange 712, 714 referring to an item 716 and thence another item 718.Contextual meaning is given to the tokens by the individual embodimentssuch that they would allow for “correct operation” of the parts of thecomponents.

Referring back to FIG. 6, before context and meaning are determined, atext translator 602 is used to provide any initial textual translationsuch as the removal of presentational formatting. The text translator602 receives natural language from a plurality of sources such as a datafile containing text, a GUI, a UPO (FIG. 5), a DBMS. The text translatorperforms natural language translations to convert the text into a formthat the text parser and tokenizer (TPT) 606 expects. The TPT 606 splitsthe text into tokens, each token separated from adjacent tokens bydelimiter characters or sequences of characters know as characterstrings contained in a delimiter list 604. Such tokenization is wellknown to those knowledgeable in the art. Each token is compared withtoken elements contained in a Token Action Pair List (TAPL) 608 and if amatch is found, the action or actions defined in the correspondingelement in the TAPL 608 are performed. Tokens that have no match in theTAPL 608 are ignored or, alternatively, an action or actions areperformed on the tokens dependent on the requirements of the embodimentof the invention. The process continues until all tokens have beencompared with those contained in the TAPL 608. An example ofcontradictory and conflicting tokens (i.e., is shown in 650), themeaning of which is determined by the actions contained in the entriesin the TAPL 608. The conflicts are typically resolved when all tokenshave been processed by the TPT 606. Example token actions includesetting various visual characteristics in a GUI, setting parameters in afile, or any other action that converts the context and meaning of thenatural language into a context required by other components of theinvention (which are described in other figures).

With reference to FIG. 8, TAPL comparisons can also use Word Identifiersthat indicate the meaning of the word and any actions that are to betaken relating to said word. Item 800 is a Word Identifier comprised ofa number of categories 810, 820, 830 and Word Classification Codes 840.Although three categories are shown, there is no theoretical limit onthe number and types of categories, and the requirements of specificembodiments should be considered. For example, the category “synonym”810 is a reference to an Adaptive list of Synonyms 850 for the word 800.The use of an Adaptive List (FIG. 19) enables the synonyms contained orreferenced in 850 to be relevant to the needs of the specific embodimentby making more prominent (or even only keeping) those synonyms that arefrequently used. The number of possible synonyms can be large, duplicateand reference yet more synonyms. The Adaptive Lists 850, 860, 870, 89,894 makes more prominent those elements relevant to the context of theword defined by Word Identifier 800. Other categories are shown, areference to a list of parents 850 and a reference to a list ofrelations 870. Although Adaptive Lists may be used in some embodiments,other embodiments utilize other forms of lists such as an SQL or hashtable. The Classification Codes 880 indicate the meaning and knowledgethat is encapsulated in the word. The “expecting type” 882, “expectingwords” 884, “Relations” 870 lists when used singularly or in combinationprovide contextual information for Associative Comparisons (FIG. 12).

With reference to FIG. 9, it is seen that the first three words of theexample sentence 700 (FIG. 7) contained by Word Identifiers 900, 912 and924. An adaptive list 936 defines some of the synonyms for the word“three” contained in 900. The number or nature of the synonyms shouldnot be considered limited to those shown in 936, 938 and 940. The otherreferences 904, 906, 914, 916, 918, 926, 928, 930 are defined inaccordance with the requirements of the specific embodiment. There arevery few words that are not related to other words and, thus, referencessuch as 904, 906, 914, 916, 918, 926, 928, 930 can become very large.

With reference to FIG. 10, it is seen how these references can spread.For example, the word “love” is considered as a “Head Word” in so far asno other words are defined that reference it. Three synonyms for “love”are shown 1010, 1020, 1030, and these in turn point to other words orlists of words. The word “devotion” 1010 having parent “love” 1000, isconsidered the “head word” for the words “devotedness” 1040 and “devout”1060, which in turn acts as the “head word” for the word “religion” inlist 1078. A “head word” can be considered as the starting point forlists of related words and that each headword can be referenced by aplurality of other headwords or keywords. Clearly, the number and typeof the relationships is limited only by the needs of the specificembodiment. The relationship between these words is defined or learnedas shown, for example, in FIGS. 12 and 13.

FIG. 11 illustrates an example series of classification codes as appliedto the word “cat” and these codes may vary between embodiments. FIG. 11supplies meaning to the word “cat” as a series of numbers representinggeneral classifications or groupings that have relevance to the specificembodiment. In the example, a cat is encompassed by categories 1110,1112, 1114, 1116, 1118, 1120, 1122 defining a cat as a Mammal (code1002), a female (code 1), of species “Felis Catus” (code 1223), acarnivore (code 1000), with a function “sleeps” (code 5002). Althoughother categories 1120, 1120 are not specified, the number and type ofthese categories is theoretically not limited.

The category codes in FIG. 11 provide a numerical representation of themeaning of the word that facilitates fast comparisons with other wordsutilizing similar and contextually compatible Classifications. In thisway, words with similar classifications can be considered contextuallysimilar; words with identical classification codes can be consideredcontextually equivalent. The term “similar” refers to the differencebetween the contextually equivalent classification codes in the wordsbeing compared. Such numerical comparisons provide the means to showthat a “cat” is not a “tulip” and also that a “cat” is very close to a“lion”. Embodiments using the NLP 500 can identify contextualcontradictions such as “a cat is a tulip” and “lions eat tulips” whilerecognizing contextually correct similarities such as “cats are lions”.Quantifying the level of these similarities with existing techniquessuch as keyword, word comparisons or Regular Expressions isextraordinarily difficult, unwieldy or even impossible.

With reference to FIG. 12, it is seen that 1210 shows that the numericaldifference 1216 between the words “cat” 1212 and “tulip” 1214 is large.Although the meaning of the difference varies between embodiments, thesize of the difference 1210 indicates a high probability that a “cat”1212 is not a “tulip” 1214 and a correspondingly small value giving alower probability that a “cat” is a “tulip.” Reference to the comparison1226 shows the difference 1224 between the words “cat” 1220 and “lion”1222. In this example, the difference is small indicating a highprobability that a “cat” 1220 is in some way related or connected to“lion” 1222. The term function 1230 defines the difference equation forcategories c0 and c0 in for Items I0 and I1. The term function 1232defines the difference equation for categories c1 and c1 in for Items I0and I1 and the term function 1234 defines the difference equation forcategories cn and cn for Items I0 and I1. The number of terms isdependent on the number of classification codes—which may vary betweencontextually similar words and specific embodiments. The term 1236defines the proximity between classification codes Ic0 and Ic1. The term1238 defines the proximity between the classification codes cx to cnwhere Δ_(cx) and Δ_(cn) are a set of cells 1238.

With reference to FIG. 13, it is seen that contextual meaning is givento Word Classifications 1300 in the form of “expecting type” 1318,“expecting words” 1320 and “trigger” 1322 that when used singularly orin combination can develop adaptive lists of words of expectedclassifications 1332, 1334, 1336 and adaptive lists of probable replies1338, 1340, 1342. 1324 shows a series of example questions.

In this example, a UPO 520 (FIG. 5) is constructed with parameters suchas usable to identify and extract information for the question 1326 “Whois the President?”. In this example, Word Identifiers (FIG. 8) andrespective Word Classifications (FIG. 11) are set to contain “the UnitedStates of America” giving contextual information. Word Classificationcomparison of the extracted data shows a very high probability that“George W. Bush” is the correct answer. The way that this reply ispresented varies among embodiments. In accordance with use embodimentvalues of the difference terms 1230, 1232, 1234, 1236, 1238 are usedsingularly or in combination to provide an indication of the accuracy ofthe result. The Word Identifiers (FIG. 8) and respective WordClassifications (FIG. 11) for the question Q1 (1326) are stored in anAdaptive List 1332.

The Word Identifiers (FIG. 8) and their respective Word Classifications(FIG. 11) for the answer A1 (1326) are stored in an Adaptive List 1338.Question Q2 (1328) now changes context contained in 1332 and 1338 fromthe United States of America. A UPO (FIG. 5) is constructed withparameters such as are required to identify and extract information forthe question 1328 “What does FINDbase do?“The Word Identifiers (FIG. 8)and their respective Word Classifications (FIG. 11) for the answer A1(1328) are stored in an Adaptive List 1340. The question “Who is thePresident” 1330 now has contextual relevancy encompassed in 1334 and1340 that when used a Word Classification comparison of the extracteddata shows a very high probability that “Ian R. Nandhra” is the correctanswer. The adaptive lists 1332, 1338, 1334, 1340, 1336, 1342 gatherWord Identifiers encompassing actual answers, probable answers and otherinformation that matches the context and meaning of the questions andthe answers supplied with the least probable information (in thisinstance Word Identifiers) being dropped from the List (this isdescribed in greater detail later with reference to FIG. 20).

The use of Adaptive Lists results in faster accesses to the mostrelevant information. Reference to FIG. 14 shows another aspect of thisinvention—referred to as “Adaptive Stores” against other types ofstorage familiar to those versed in the art. A conventional volatilestore lacks the ability to retain information stored therein for periodsof time without power. A persistent store has the ability to retaininformation stored therein for periods of time with or without power.Volatile stores are typically very much faster than persistent storesand are preferred for rapid retrieval of small amounts of information.Persistent stores are typically very much slower than volatile storesbut are preferred for storing large amounts of information. Persistentand volatile stores used in the art require the absolute location of adata item be known prior to its retrieval from the store. Finding anitem of data in the store without knowledge of its absolute locationrequires that the store be searched from an initial starting positionuntil the item is found. The time for such a search is dependent on suchfactors as the access speed of the store and the amount of data in thestore. The closer the object of a search is to the initial startingposition, the quicker it will be found. Storage devices typical of thoseused in the art also lack the ability to group similar items in closeproximity in the store to enable faster discovery during a search. An“Adaptive Store” as referred to herein includes functionality thatgroups the most currently used data in the store (perhaps keeping onlythe most currently used data, to the exclusion of other data), thusgreatly speeding up data searches and reducing the amount of irrelevantdata in the store.

The Adaptive Store 1404 may include any combination of volatile stores1400 and persistent stores 1402. These are special types of stores,termed a “cell”—reflecting the smaller amount of data and that the“cell” includes trigger mechanisms 1410, 1422 not found in conventionalstores. In this example, the volatile cell 1410 is comprised of computermemory such as random access memory or any volatile memory device. Thepersistent cell 1420 in this example is comprised of an SQL 1412, NVRAM1414 and a database 1416, in addition to the event trigger mechanism1418.

Example event triggers 1422 access a singular or plurality of devicessuch as computer programs when certain conditions have occurred in thestore. The conditions shown in 1422 vary among embodiments, but are notrestricted to these examples. Each trigger 1422 has actions associatedwith it commensurate with the requirements of the specific embodiment.For example, such actions could be used to load data into other AdaptiveStores, load data, configure a UPO, etc. Actions associated with trigger1424 are performed whenever an element is read from the store. Actionsassociated with trigger 1426 are performed whenever an element iswritten to the store. Actions associated with trigger 1428 are performedwhenever an element is searched for in the store. Actions associatedwith trigger 1430 are performed whenever an element is promoted in thestore. Actions associated with trigger 1432 are performed whenever anelement is demoted in the store. Actions associated with trigger 1434are performed whenever an element is dropped from the store. Actionsassociated with trigger 1436 are performed whenever an element is addedto the store. Actions associated with trigger 1438 are performedwhenever an element is inserted into the store. These triggers form theinteractions between Adaptive Stores and Adaptive Lists as describedelsewhere in this patent application.

FIG. 15 illustrates interconnections of Volatile and persistent AdaptiveStorage cells across a network and residing in the same system. A commoninterface 1604 (see FIG. 16, discussed later) allows remote and localcells to be accessed in the same way. Cells 1504, 1510, 1520 and 1526are connected to other cells either locally or remotely, the exactnature of the interconnections being dynamically changeable anddependent on the requirements of the specific embodiments. In the FIG.15 example, cell 1504 can access data in cell 1522 via the remoteinterface 1508. An Application Programming Interface (API) asillustrated in FIG. 16 addresses the incompatibilities andinconsistencies between various storage devices. Thus the Random AccessMemory 1606 and the SQL 1608 and the Adaptive Store 1614 and the RemoteAdaptive Store 1616 can all be accessed in the same way from local 1600and remote (i.e., networked) 1602 locations.

FIG. 17 illustrates an Adaptive Store including an Index 1706 and anadaptive List 1710 referencing data objects 1700, 1702, 1704 and 1708.The Index 1706 facilitates fast look accesses to specific elementswithout the need to traverse the Adaptive List 1710. Such indexing isuseful when the exact nature of the data element being referenced isknown. For example, hash tables are a fast index available for use withdata items whose nature is exactly known, and allow objectN 1708 to belocated without having to search through all the elements A through N inthe List 1710.

If the exact nature of the data is not known or an associative match isrequired, the Adaptive List 1710 is searched using the ClassificationComparisons as described earlier with reference to FIG. 12. Althoughhash tables can be sequentially accessed (allowing ClassificationComparisons on each element), hash tables do not provide for access tostored data objects in a consistent or uniform manner. For example,searching all the elements in a hash table using conventional techniquesmay return the elements in the order CEDAHFGNB; whereas, the next timethe hash table is searched the elements could be returned in a differentorder BACEFHNGD. As can be seen from previous examples, searching for aclose match to an unknown data item or searching for an item when itsexact location in the store is unknown is faster if the most frequentlyfound data items are closer to the starting location of the search.

Although hash tables and other index devices are fast, they do notprovide for arranging data items in a particular order. Indexes cantypically increase the amount of storage space required for the dataitem and also increase the time taken for data items to be added,inserted and removed from the Adaptive Store. FIG. 18 illustrates anAdaptive Store without any indexing mechanism as may be particularlyuseful for embodiments having limited storage capacity.

With reference to FIG. 19, a list component of an adaptive store isshown in an initial state 1900. Notice is drawn to the position of thedata items in the list—elements A through N corresponding to listlocations 0 through n and with particular reference to list location 2of the initial state 1900 containing elementC and list location 3 of theinitial state 1900 containing elementD. The state of the list elementsis shown in state 1902 after a first read access has been made toelementD in list location 3 of initial state 1900. Attention is drawn tothe position of elementD which has been promoted in state 1902 oneelement up the list and elementC has been demoted in state 1902 oneelement. Subsequent accesses to elementD can be seen in states 1904 and1906 to result in elementD being promoted one element toward the startof the list until elementD is the first element in the list 1906. Accessto elementG in state 1908 shows the promotion of elementG towards thestart of the list and access to elementH in state 1910 shows thepromotion of elementH. In this way, the most frequently accessedelements are moved to the front of the list facilitating fastercomparative searches.

The addition of a new elementZ at state 1912 results in the least usedlist member elementN being replaced by elementZ. The manipulation of thedata elements in Adaptive Lists are not limited to being as specificallyshown in this example. Other manipulations, such as deletion of elementsin the list, insertion of elements into the list, and sorting are alsoemployed in some embodiments.

Furthermore, adaptive lists can assign priority values to specificelements or sets of elements as shown with reference to FIG. 20 and 21.Prioritizing a list element increases its probability of remaining inthe list or reaching the top of the list. The use of priority valueshelps to ensure that prioritized elements remain in the list or, on thecontrary, that an element will be quickly dropped from a list.

Item 2000 shows the initial state of an adaptive list prior to elementEbeing accessed. Item 2002 shows elementE's promotion, Item 2004 withelementE and elementC having the same priority value and finallyelementE being promoted with a higher priority value than elementC. Anexample algorithm is shown in FIG. 20-1. Specific attention is drawn tothe other priority values which are of varying values resulting from,for example, other elements being inserted and removed. Such elementsmay have been inserted from other adaptive stores in other systems on anetwork, for example. Weighted list values facilitate sharing AdaptiveList data and alter the priority of the elements in a list acrossadaptive stores.

Another example of element prioritization is the adaptive migration ofelements between adaptive stores interconnected on a network. Anotherexample is the pre-loading of Word Identifier lists based on thepriority of a headword (FIG. 10) in an adaptive list. The use of eventtriggers 1410 (FIG. 14) facilitates contextual interaction with singularor a plurality of other lists within the same embodiment or a pluralityof interconnected embodiments. The example described is just oneembodiment of a list prioritization, and the particular method ofcalculating and utilizing the weighting values differs amongembodiments. For example, some embodiments may use a weighting value tosink elements to the bottom of the list rather than promoting them tothe top. This “negative bias” produces lists containing the leastfrequently used elements. Negative bias lists are, for example, used byembodiments to eliminate least used elements from singular or aplurality of lists. The data remaining in these lists has by definitiona higher relevancy than if the list still contained known little usedelements. Embodiments using a combination of weighting and prioritizedlists can significantly reduce the time taken to perform WordComparisons (FIG. W7) as the amount of data requiring comparison isgreatly reduced. It can be seen from FIG. 13 in particular that adaptivelists for probable classification 1332, 1334, 1336 and probable replies1338, 1340, 1342 may be built using triggers associated with theaddition, insertion, deletion, promotion and demotion on elements inAdaptive Lists.

Now, extraction of data is discussed. To extract data of interest fromdata sources, the client node uses a knowledge of the nature of the datato be extracted, where to extract the data from and how to extract thedata. The location of data in data repositories is often subject tochange, a particularly good example being the world wide web. To locatedata of interest, it is useful to have knowledge of the initial locationof the data, how to recognize that the data has moved and how toidentify the new location. For example, data repositories on the worldwide web typically employ a series of URL's in a chain that eventuallylocate a data item as shown in FIG. 21. URL 2100 is for an archive sitewith a page containing links to five files, file01, file02, file03,file04 and file 05 spanning URL's 2102, 2104, 2106, 2108, 2110, and2112. Since these files are spread over a number of different URLlocations and on different repositories, a mechanism is employed tofilter out links to sites that are not of interest. This includes ratingthe links in a page as “of interest” and “of no interest”. Links “of nointerest” are not followed. Links “of interest” are followed to aspecified depth. By applying data locators to pages pointed to by links“of interest”, data is tracked or followed data across a plurality ofsites. The depth parameter defines the number of links to be followedbefore data of interest is found as a result of matching data locators.

In one embodiment, the knowledge of the data discussed above isencapsulated into a Universal Data Locator (UDL) in the form of a seriesof parameters that, once combined, form a Universal Data Locator Object(UDIO). With reference to FIG. 22, the parameters for creating a UDIO2210 originate from item 2200 that is an embodiment of a parametersource as described with reference to FIG. 5. In embodiments where thedata location is known, or where the data is not subject tounpredictable changes in location, or when the nature of the changes areotherwise described by a parameter source, these parameters aresometimes combined with the data locators 2206 and the extraction method2208 to form a UDIO 2210. The location of data in pages on the worldwide web and the location of the page on the world wide web are bothsubject to unpredictable change requiring further steps in theconstruction of a UDIO 2210.

FIG. 7 illustrates operations for locating and tracking such changingdata. A UPO 2300 describes a world wide web Uniform Resource Locator(URL) from which the page pointed to by the URL is downloaded 2306.Alternatively, if a URL is not provided, a UPO 2306 provides a WWW pagesource in a form expected by the data locator 2308. The data locatoruses parameters provided by a UPO 2304 to determine the location of thedata of interest on the page source provided by 2306 or 2302 and formatsthe location determination into a form usable by 2314. Keys identifyingthe data are generated at 2314 using parameters from UPO 2316. In theevent that the data is split across several URL's or data links, thelocations and definitions of such URL's and links are determined by 2312using parameters from UPO 2310, 2324. The particular operation of 2306,2308, 2312 and 2314 are typically dependent on the nature of thespecific data of interest that varies among different embodiments. Forexample, an embodiment may use regular expressions that are a commonlyused programming technique well known to those skilled in the art toidentify data and the location of data. Embodiments may use a singularor a plurality of Adaptive Lists. Example location information may bethe number of bytes from a particular known location UDI 2204 such asthe start of the data source. Another example is a number of bytescontained between two identification keys 2314. Another alternative is adata search for strings or characters or an adaptive search or otherdata in pages until a match is found. Another example is an extractionmethod embodied in a software executable or source code compatible withand capable of being executed in a manner facilitating the correctextraction of the data of interest. In some embodiments, the user maycontrol the nature of these searches from parameters from a UPO. Inother embodiments, the contextual meanings are provided relevancy by theuser to determine the contextual accuracy. Such relevancy may beprovided from a GUI interface, speech recognition, email or otherdevice. This relevancy forms part of the weighting where positive valuesindicate increasing relevance and negative values indicate decreasingrelevance. Embodiments such as those employing Word ClassificationDifferences would give relevancy to the difference between thecontextual Word against the Word Classification code being compared.

Accuracy of a data item can be defined as the measure of differencebetween the data item and a single reference data item or plurality ofdata items. In a situation where data is being discovered during, forexample, a search operation on repositories on the World Wide Web thereference item or items are being determined from the discovered data.Clearly, the more data being examined, the more accurate the referenceswill become. In an example comparison, discovered data is firstdecomposed into a plurality of meaning definitions that are stored inassociation with the discovered data in an adaptive store. The priorityvalue that the data object takes in the store (FIG. 20) is proportionalto the comparison difference (FIG. 12) when the meaning is compared withother data. In this way, the adaptive store contains the most relevantdata item at the start of the store (or is otherwise made moreprominent). For example, an adaptive store termed “Reference Store”would contain discovered data in association with the meaning of thesaid discovered data that could then be used as a reference againstwhich other data could be compared. The comparative difference betweenthe compared data and the reference data could be used to influence thepriority values (FIG. 20) in the store thus providing a hierarchy ofcontextually relevant reference data.

The data location 2204 or URL location 2202, data item selectioncriteria 2206 and extraction method 2208 are combined into a UniversalData Identifier Object 2210 that may be stored in a “persistent objectstore” or on media or in another storage mechanism in a compressed oruncompressed form. Persistent object stores are used for temporary orpermanent storage of data, computer executable programs and computersource code the repository of which can reside independently or on anynode in a network.

Referring to FIG. 24, a UDIO 2400 provides a description of data to beextracted to an extraction engine 2402 that uses the data descriptorinformation in the UDIO 2400 to locate and extract the described datainto a persistent storage repository taking supplemental parameters fromUPO 2410. Data normalization 2404 is performed using parameters in theUDIO 2400 and those in UPO 2406 to convert the extracted data into acontext expected by the Analysis Engine 2408. Using parameters from UPO2410, the analysis engine 2408 takes normalized data from 2404 andperforms such operations, comparisons and operations as specified inUDIO 2400 and UPO 2410. Analysis output from 2408 may be stored in arepository before being directed to the results processor 2412. Theextraction engine 2402 operates on the parameters contained in the UDIO2400. For example, an entire site may be downloaded by setting the UDIOparameters to a URL and setting parameters to download every pageindicated by every URL reference in the page. In another example, allpages containing text contextually matching those Word Classificationsin an Adaptive store may be downloaded from a plurality of sites.

It is common practice for URL references in a world wide web page topoint to other world wide web sites, which could result in all the pagesin the entire world wide web being downloaded. To avoid, for example,downloading (or attempting to download) the entire world wide web, someembodiments include exclusion parameters in the UDIO and UPO-2406, 2410to control which URL's the extraction engine 2402 may follow. Suchexclusion parameters may exclude a plurality of domains, node and datarepository locations, and specify a maximum depth into a repository thatthe extraction engine 2402 will follow data types pointed to by a URL.In other embodiments, the exclusion parameters and URL traversals areperformed in other parts of the system such as data normalization 2408.Examples of these exclusion parameters allow an embodiment to extractall the email addresses from specified repositories. Another exampleembodiment uses these exclusion parameters to only download files of aparticular type such as music or pictures. Another example embodimentuses the extraction parameters to follow price information across manyrepositories. Another example embodiment uses a mixture of extractionand exclusion parameters in Adaptive Lists to find contextually similarphrases and text in a plurality of sites where each site is traversed inaccordance with the results of Word Classification Comparisons betweentext discovered in the sites and that supplied by a user or from a UPO.

The extraction engine is shown in more detail with reference to FIG. 25.A UDIO 2500 describes parameters of data locations to be extracted thatform the initial entries 2502 in list 2504. The data to be extracted isextracted from the data repository with the extraction engine 2506before being stored in a persistent store 2510 and forming the input tothe URL list manager 2508. The data in the persistent storage 2510 isnormalized 2514 to convert it into a context which can be more easilyused by other elements, such as the analysis engine 2408. In someembodiments, such normalization employs Word Classifications in AdaptiveStores to reconstruct contextually compatible formats that are usable byother elements such as the analysis engine 2408. This normalizationaddresses the problems of inconsistent and incompatible words, phrasesand other data in the persistent store 2510.

With reference to FIG. 26, URL's describing data to download are removedfrom the URL list 2604 in accordance with parameters supplied by UPO2600 and the UDIO 2606. Example parameters 2614 determine the way inwhich data is extracted from the URL. Using the world wide web as anexample, world wide web servers typically maintain information on thenumber of times the server has been accessed, each access being referredto as a hit and information pertaining to each hit is typically recordedby the world wide web server for later analysis to determine, forexample, the origin of the hit and what area of the world wide webserver was accessed by the hit. World wide web servers have limits onthe number of hits for which they can concurrently supply data andembodiments of this invention could easily exceed the number of hitsthat a world wide web server could support. Additionally, world wide webservers typically record the type of browser that was used to access thedata on the server. In accordance with some embodiments, there iscontrol over the way that the hits appear on the world wide web serverand/or to appear as a particular type of browser, and/or to appear as ahuman operator access, a practice known in the art as spoofing.

The combination of time interval, page sequence and browser typeinformation that the world wide web server records about the access istermed a hit signature. Element 2614 show various parameters 2614-4,2614-5, 2614-6, 2614-7, 2614-8 and 2614-9 controlling the periodicity ofhits on the world wide web server. Analysis of hits and hit signatureson a world wide web server can provide information about the origin ofthe hit. For example, parameters 2614-1, 2614-2, 2614-3 define the orderof the URL's accessed on the world wide web server and these, incombination with 2614-4, 2614-5, 2614-6, 2614-7, 2614-8, 2614-9 can makeaccurate analysis of world wide web hit signatures almost impossible.This is especially useful to detect spoofing accesses. For example, ahuman's speed of access to URL's is limited by the speed in which thebrowser being used can display the page. The display speed of a page canoften span several (often tens of) seconds if the server being accessedis using the widespread and popular practice of displaying banneradvertisements before displaying the rest of the page. Thus any accessfaster than, for example, 500 ms could be considered to be from anautomated mechanism. Furthermore, concurrent or parallel access to theserver is limited by the speed of the computer hardware, network and theability of a human to select URL's on a web page within the previouslydiscussed 500 ms. Thus, there is a relatively high level of probabilitythat hits with a frequency greater than 100 ms are from an automatedmechanism, as it is virtually impossible for a human to access linksfaster than this due to being limited by browser refresh speed and thehuman ability to visually and physically respond to displayedinformation. However, it is difficult to determine that a slow hitfrequency is from an automated mechanism and not from a human, withoutsome other indicia (e.g., the mechanism accesses the robot.txt filewhich servers often provide as a control mechanism when the server isindexed by a WWW search engine). The robot.txt file is not normallyaccessed from a human using a web browser, but such access could spoofthe server into considering that the human access is in fact anautomated mechanism.

The determination of which URL to download is performed at 2602 usingthe parameters 2614 from the UPO 2600 and UDIO 2606 as previouslydescribed. In one embodiment, removal of URL's is arbitrated from thestart and end of the list, waiting for a random period of time betweeneach removal. In other embodiments, elements are sequentially removedfrom the start of the list with no time interval delay. Combining timeintervals with random selection of URL's can emulate the way in which ahuman would access information on a web site whereas a fast sequentialaccess would enable analysis of world wide web server hits to determinethat a machine is making the hits.

In some embodiments, use is made of both time interval and random URLselections, as described in element 2514. Emulating or spoofing humanaccess behavior is further enhanced by using measured values for atleast some of the parameters 2614. Such parameters can be obtained byusing a browser to measure tmin, tmax and tav access times for specificpages on the web site to be accessed on a range of internet connectionsproviding for representative internet access speeds. Internet accessspeeds and latencies typically vary between periods of the day and thedays in the year. In accordance with some embodiments, parameter valuesfor 1014 are taken from averages over period of time and under differentconditions.

Once identified, the URL is removed from the URL list 2604, by 2608 andparsed to a download thread 1010. Use may be made of thread pools 2612which are known in the art. The URL data download is performed by 2616and converted into a form usable by 2510 and 2508. The persistent store2508 records accessed URL's which are POO.

Turning now to FIG. 27, the URL list manager 2708 components aredescribed. The downloaded URL 2708 and 2506 is analyzed 2702 for thepresence of any further URL's, and are extracted into a temporary list2704. Each element in list 2704 is validated against parameters from UPO2706 and UDIO 2710, and valid URL's are added to the URL list 2704 by2712 in accordance with parameters from UPO 2706 and UDIO 2710. Forexample, URL's that are outside the scope of a plurality of repositorieson a network may be rejected. Alternatively URL's that do not contain acertain plurality of data or data sets may be rejected.

With reference again to FIG. 24, extracted data is analyzed by AnalysisEngine 2408 using parameters in UPO 2410 and UDIO 2400 before beingparsed to the results processor shown in FIG. 28. Referring now to FIG.28, the type of formatted output is selected by switch 2804 usingparameters from UPO 2802 using data from 2800. Example formatted typesare shown as 2806, 2808, 2810, 2812, 2814, 2818, 2820, 2822, 2824.Embodiments of the invention utilize other formatted outputs in additionto or instead of those shown. The functionality of each of the formattedtypes may vary from embodiment to embodiment and may also vary accordingto the nature of the data 2800 being formatted. Examples shown providefor the results to be stored 2808, converted into the HTML 2810 markuplanguage that are displayable by a web browser, email 2812 the resultsto a plurality of destinations, fax 2818 the results to a plurality ofdestinations, store the results in a data base management system 2820,store the results as data in a file 2822, and encode 2824 the data.

In accordance with some embodiments, a plurality of contextually similardata 2816, 2800 are subject to a comparison 2806. The nature of thecomparisons may vary according to the nature of the data being comparedand the required formatted output. Comparator 2806, results formatters2806, 2808, 2810, 2812, 2814, 2818, 2820, 2822, 2824 and compression2826 use parameters from UPO 2802. The formatted results are presentedfor output and storage 2828. Using the initial rose example, someembodiments may use the comparator 1206 to compare extracted data from aplurality of rose suppliers and sources and generate a report comparingthe prices and varieties found.

Although the above examples have emphasized the applicability of thedata extraction method to use with the world wide web, the techniquesdescribed may be usable with any data source, such as a flat filecontaining data, with or without data chaining information such as URL'sand/or other hyperlinks.

The present invention may be provided as one or more computer-readableprograms embodied on or in one or more articles of physical manufactureand one or more articles capable of light, electromagnetic, electronic,mechanical or chemical or other known distribution. The article ofmanufacture may be an object or transferable unit capable of beingdistributed, installed or transferred across a computer network such asthe Internet, floppy disk, a hard disk, a CD ROM a flash memory card, aPROM, a RAM, a ROM, a magnetic tape or other computer readable media. Ingeneral, the computer-readable programs may be implemented in anyprogramming language, although the Java language is used in someembodiments, and it is useful if the programming language has theability to use Regular Expressions (REGEX). The software programs may bestored on or in one or more articles of manufacture as object code.

Some examples of how the described embodiments operate are nowdiscussed. The first example illustrates a “page hierarchy extraction”.Namely, the increasing use of Active Server Pages and other mechanismsthat dynamically generate world wide web page content interfere with thegeneration of a hierarchy or tree of pages to be generated. Such ahierarchical tree representation is extremely useful when performingserver administrative tasks. In addition, such a representation allowsthe world wide web designers to understand the data and structurallayout. Using the described method, such a representation may beproduced in configuring the internal components in a manner describedherein.

The UPO 520 (FIG. 5) is configured to include the URL of the start (homepage) of the world wide web site to be extracted. Since the pagecomponents are not being analyzed, it is not necessary to configure anyparameters for the NLP (FIG. 6).

Parameters for the UDIO 2210 (FIG. 22) are configured in a UPO 2200. Thedata location 2204, data item locators 2206 and extraction method 2208are not required since it is desired to extract the entire page. The URLlocator 2322 is set to accept any URL contained within the WWW site andto reject any URL outside the boundaries of the WWW site, data locators2308 set to identify any URL which would be extracted 2318 by using theidentifier ‘<a href=’ and ‘</a>’ as start and end delimiters. Theextraction engine 2402 extracts the data from URL's and the data isnormalized 2404 by using parameters 2406 to discard the page contentsand keeping the page title forming the title and URL into tabular datain a form expected by results processor 2412. Encountered URL's ofinterest are added to the list of URL's to extract 2410.

The analysis engine 2408 is configured by UPO 2410 to take no action andthe page title, URL, URL parent and URL siblings (from 2506 and 2510)are parsed to the results processor 2412 as tabular data. UPO 2802 isconfigured with parameters to take analyzed data 2800 (from 2408) forstorage 2808 in a tree representation and to email 2814 the extractedURL and title to a plurality of locations. The process by which anentire world wide web site is traversed can be extended to otherapplications such as the comparison of two sites.

The second example is site extraction. This is similar to the firstexample in that all URL's of interest are traversed and those URL's thatare of no interest are ignored. The UPO 520 (FIG. 5) is configured toinclude the URL of the start (home page) of the two world wide web sitesto be extracted. Since the page components are not being analyzed, thereis no need to configure any parameters for the NLP (FIG. 6). Parametersfor the UDIO 2210 (FIG. 22) are configured in a UPO 2200. The datalocation 2204, data identification keys 2314 and data extractiontechnique 2318 are configured (FIGS. 3 and 4) using UPO 2216 and UPO2220 to extract data of interest from each extracted URL of interest.The URL locator 2322 is set to accept any URL contained within the worldwide web site and to reject any URL outside the boundaries of the worldwide web site. The extraction engine 2402 extracts the data from URL'sand the data is normalized 2404 by using parameters 2406 and 2410, intotabular data including the URL or the page, into a form expected byresults processor 2412 and comparator 2806.

Encountered URL's of interest are added to the list of URL's to extract.The process by which an entire world wide web is traversed onlyfollowing links of interest can be extended to embodiments that includethe addition of more sites. Furthermore, the client data extractor (CDE,FIG. 24) can be extended to extract data from other sources that can beused by the results processor (FIG. 28).

Attention is now turned to the aspect of the invention relating toanalysis of access to an information server. The origin of an access toan information server (referred to as a hit) is difficult to determinewithout resorting to visual or human contact. Hits can originate from ahuman using a client executing a WWW Browser and also from mechanizeddevices and computer programs (robots). As used herein, the term“information server” includes any device or machine capable of acceptingdata requests and supplying information to satisfy the request.

FIG. 29 illustrates a network of information servers and informationassessors, such as may exist on the internet. With reference to FIG. 29,client nodes 2900 and extraction robots 2908 are coupled to a pluralityof information servers 2902 by a common network 2901. The servers 2902employ a mechanism to identify and differentiate the hit origins so thatthey may take appropriate action when a client node 2900 makes an accessand different actions when the extraction robot 2908 makes the access.Client nodes 2900 may be in the form of humans and/or othernon-automated devices using a browser, software applications that accessdata on the server and other mechanized agents to which the serverwishes to provide data. Such users are referred to as a friendly accessor friendly hit. Extraction robots 2908 may be in the form of spiders,robots, crawlers and other software or mechanized agents that requirelittle or no human intervention in operation.

Such extraction robots 2908 devices are often used for competitiveanalysis or for “stealing” copyrighted material, and are devices towhich the server 2902 may wish to block access. Such accesses arereferred to as an unfriendly access or unfriendly hit. Each hit (orcollection of hits) has a signature that provides information that canbe used to identify friendly and unfriendly accesses to the server.Determining the absolute origin of a hit would require physicallytracing the network connection from the server to the origin. Physicallytracing the network connection is virtually impossible in practicalterms since the duration of the hit may be shorter than a second. It ispossible to determine a probability of the origin of the hit fromanalysis of the hit signature against known properties of the server,known typical properties of clients using browsers 2900 and extractionrobots 2908.

FIG. 30 shows timing properties 3002 to load a page into a displaydevice such as a world wide web browser; properties of human access to aworld wide web page 3004; and properties of non-human access to a worldwide web page 3006. With reference to 3002, although variations existbetween commonly used and other embodiments of browsers and othermechanisms used to display a world wide web page, the basic timings aresplit into the activities required to load the textual part of the pagetext, the activities required to process other components in the pagesuch as scripts t_other the total minimum time to load the page beingt_min. With reference to 3004, the activities that a human operatortakes to react to and access a URL are the time to respond to adisplayed page and access a URL t_response, the activities required forthe apparatus (e.g., Browser) to process the URL access t_internal,other miscellaneous times t_other, the total minimum time to load thepage being t_min.

The corresponding value for t_max can be infinite. Human reactions arestatistically longer than that of a mechanical device. Additionally,visual access to URL's can require the entire world wide web page to beloaded as URL's forming part of an image map (FIG. 32 discussed later)cannot be accurately accessed until the image appears on the screen. Thenow common practice of displaying advertisements prior to displaying therest of the page further increases the response time. Image mapstypically require the user to position a pointing device over areas ofthe image onto which URL's have been mapped. Furthermore, the way inwhich humans typically interact with a browser usually results in pagesbeing re-loaded. For example, with reference to FIG. 31 b, the pagehome.htm has to be loaded to enable the page “bedrooms.htm” to beselected. The page “kitchens.htm” cannot be selected until the page“home.htm” is re-loaded as the page bedrooms.htm is being displayed.These times are included in t_other.

Turning again to FIG. 3 with reference to 3006, the activities of anon-human mechanism to react to and access a URL is dependent in part onthe speed with which the textual component of the world wide web page isavailable to the mechanism. This is because the mechanism can see URL'scontained in the page, whereas a human operator has to wait until theweb page displaying device (a browser) has displayed the page completewith all the images that may contain URL maps. Although there are othertiming components, they are relatively small. The activities to obtainthe textual part of the page t_text, the time for the apparatus torespond to a URL access t_internal, the time to process all other itemst_other resulting in a minimum time to access a URL from a page t_min.The corresponding value for t_max can be infinite. Non-human access to aURL is very much faster as only the textual part of the page containingthe URL's is required.

FIG. 31 a to 31 d shows an example of a simple web page hierarchycontaining simple URL's embedded in the textual part of the page. It ispossible for URL's to occur in other parts of the page such as imagemaps (as shown in FIG. 32). Referring to FIG. 31 a to 31 d, the pagelayout relationships depicted in the commonly used tree representationare shown. FIG. 31 b shows the page relationship as a simplified visualrepresentation. FIG. 31 c shows an example of the information typicallyrecorded by information servers.

The “Requester ID” 3110 identifies the device requesting information, inthis case an Internet IP address. The ‘Data Item ID’ 3112 is anidentifier uniquely locating the data item being accessed in theinformation server. The ‘Time Stamp’ 3114 is the time of the requestusing the time local to the information server. The ‘Type Of Access’3116 in this instance contains supplemental information.

The access times shown are typical minimums and represent the entiretime to load and respond to the required URL. The sequences 3118-3144are a hit signature for the data items accessed. Each hit 3118-3144represents an individual access to the server and from these we candefine the terms th_av and th_max and th_min can be defined, whichdescribe the average access time between hits, the minimum access time(i.e., fastest access) and maximum access time (i.e., slowest). Theseterms form the time component of the hit signature. The other componentof the hit signature is the order in which the pages were accessed.

From FIG. 31 c, it can be seen that each page access required the parentpage to be loaded. Furthermore, once 3136 was loaded, the parent 3138had to be loaded before the next page 3140 could be accessed. This partof the hit signature employs knowledge of the tree relationship as shownin FIG. 31 a. Such relationships are not always known, especially if theinformation server is dynamically generating page content or is an ASPserver. However, it has been shown how such a relationship can bederived by referring to Example 1.

Reference to FIG. 31 d shows how the same pages may be accessed from anextraction robot that is making no attempt to disguise its activity. Thesequences 3158-3174 are a hit signature for the data items accessed.Each hit 3158-3174 represents an individual access to the server, andfrom the hits the terms tm_av and tm_max and tm_min can be determined.Tm_av, tm_max and tm_min describe the average access time between hits,the minimum access time (i.e., fastest access) and maximum access time(i.e., slowest) respectively. These terms form the time component of thehit signature.

Another component of the hit signature is the order in which the pageswere accessed. From FIG. 31 d, it can be see that the pages 3164, 3166,3168, 3170, 3172 and 3174 were loaded almost concurrently and withoutfurther reference to the parent page. This part of the hit signatureemploys knowledge of the tree relationship as shown in FIG. 31 a. Suchrelationships are not always known, especially if the information serveris dynamically generating page content or is an ASP server. However, itcan be seen how this relationship can be determined, again withreference to the page hierarchy example.

Referring to FIG. 32 a and 32 b, a page hierarchy is shown in FIG. 32 aand the visual representation showing the URL's embedded within an imagemap, represented by birds in this example, is shown in FIG. 32 b. Thehit signature component calculations are similar to those described withreference to FIG. 31 c and FIG. 31 d with the exception that slightlylonger access times than those shown in FIG. 31 c are expected. Theeffect of image maps is depicted in FIG. 32.

Pages that employ frames typically do not allow the user to use browserbookmarks or other shortcuts to go directly to a page. For a browseruser to go directly to the file BirdsOfPrey.htm without having firstloaded the page home.htm. FIG. 32 shows a series of pages employing thepopular image map method of URL access. That is, with the image mapaccess method, the user positions a pointing or other control deviceover the portion of the page depicting the information required, (e.g.,represented by birds) and ‘select’ the URL in a way consistent with theapparatus used to display the page and image. Although the user couldpotentially try to remember the location of the URL on the page andposition the pointing or control device accordingly before the image isdisplayed, there is no guarantee that the image will appear in exactlythe same location on the page. Thus, human users typically wait untilthe entire image has been loaded, which could mean waiting for theentire page to load.

Reference to FIG. 33 a to 33 c shows methods employed to analyze hitsignatures from FIGS. 31 c and 31 d with reference to the definitionsset forth in 3002, 3004 and 3006 (FIG. 30). These methods calculate aprobability that the origin of a hit is either human or non-human inorigin by comparing actual hits with theoretical values and valuesdetermined by empirical means. These methods can calculate the hitorigin probability of a hit after the accesses have occurred, but thisis generally less useful than determining the hit origin probabilitysubstantially as the hits are occurring, thereby allowing appropriateresponsive action to be taken in a timely fashion.

The determination includes initially generating the reference hitsignatures for both human and non-human access. Human access referencehit signatures are calculated in one embodiment by repeatedly accessingthe server with a browser from human control, under varying conditionsand determining the average value for t_min, t_av and t_max. Non-humanaccess reference signatures are calculated in one embodiment byrepeatedly accessing the server with a mechanized extraction robot suchas that described in the page hierarchy extraction example under varyingconditions and taking the average value for t_min, t_av and t_max.Determining the signature for a human access is more reliable than formechanized devices that are attempting to spoof or otherwise emulatehuman behavior. The human access signature is determined from the valuesof t_min , t_av and t_max in relation to their corresponding referencevalues and also the numerical distance between these values and theircorresponding reference values for non-human access. These calculationsare shown in FIG. 33 b and provide values that are usable in conjunctionwith other parameters to provide an overall probability of the origin ofa hit as shown in FIGS. 34 a and 34 b.

Another weighting value may be added, where the weighting value is basedon the parent of the data item accessed. With reference to FIGS. 31 cand 31 d, it can be seen that human access to a data item almost alwayshas to travel back to the parent page 3134 before other pages 3136,3140, 3144 referenced by the parent maybe accessed. Data items accessedby a robot, on the other hand, frequently avoid this unnecessary accessas can be seen in 3164, 3166, 3168, 3170, 3172 and 3174 unless the robotis deliberately attempting to spoof the server or emulate particularbehavior (such as human behavior). Determining if the parent page hasbeen traversed prior to another page reference by the parent is accessedemploys a tree hierarchy that is quickly accessible, as shown in FIG.35. The meaning of the probability values derived from FIGS. 33 andFIGS. 34 typically varies according to and between specific contexts ofembodiments of the invention.

Referring now to FIG. 33 a, reference terms that apply to both a humanand non-human access are shown. Term 3310 defines the time to react toand access a URL from a page. Term 3312 defines the time for anapparatus to respond to a URL access. Term 3314 defines the total timefor all other activities. Term 3316 defines the minimum access time forthe URL access. Term 3318 defines the maximum access time, which inpractical terms can be considered infinite.

Human access is typically longer than for non-human access for term3310. Terms 3312 and 3314 are almost the same for human and non-humanaccess and tend to be small under normal circumstances. In accordancewith some embodiments, terms 3310, 3312, 3314 and 3316 are measured withreference to sample browsers and the servers being used under varyingconditions of usage.

Reference to FIG. 33 b shows a forward difference term 3320 that is thedifference between two hits, n and n+1 (i.e., 3320 and 3322). Term 3322describes an average difference for a range of hits n0 to n. Term 3324describes the minimum for a range of hits n0 to n. Term 3326 describesthe maximum for a range of hits n0 to n. The terms in FIG. 33 b are usedto define reference terms for human and non-human access. Reference toFIG. 33 c shows the terms used to describe a human hit signature. Term3330 describes the minimum signature value that is the differencebetween a hit and the human reference minimum 3316.

Term 3332 describes the maximum signature value that is the differencebetween a hit and the human reference maximum 3318. Term 3334 describesthe average signature value that is the difference between a hit and thehuman reference average 3316. Term 3322 could be substituted for term3316, providing a rolling average. Terms 3330 could also use term 3324to include the average minimum time. Term 3332 could also use term 3336to include the average maximum time. Term 3336 defines the average ofterms 3330, 3332 and 3334 providing a dampening factor. The usage ofthese terms influences the accuracy of the calculations between specificembodiments and under specific load conditions, for which it isdifficult to generalize.

FIG. 34 a shows how the terms are combine to form a probability valuethat the specific embodiments can use to determine if a hit is human ornon-human in origin. Some embodiments use this value as input to agraphical user interface or other display device to indicate that natureof the hit. Some embodiments also provide the facility to takeappropriate action for the hit. Such action might, for example, be todisallow hits that have a very high probability of being non-human inorigin.

FIG. 34 a describes proximity terms that indicate how close a hit is tohuman and non-human references. Term 3400 defines the difference betweenthe minimum human signature value and the robot minimum signature. Term3402 defines the difference between the average human signature valueand the robot average signature. Term 3404 defines the differencebetween the minimum human signature value and the robot minimumsignature. The terms 3400, 3402 and 3404 may be positive or negative andare used to determine the probability terms 3406, 3408 and 3410.Decreasing positive values or increasing negative values indicate higherprobabilities. Term 3406 defines the probability that the minimumsignature is of non-human origin by calculating the distance to therobot minimum reference value derived from 3314 (FIG. 33). The closer3406 is to term 3314, the higher the probability. Term 3408 defines theprobability that the average signature is of non-human origin bycalculating the distance to the robot average reference value derivedfrom 3312 (FIG. 33). The closer 3408 is to term 3312 (FIG. 33), thehigher the probability. Term 3410 defines the probability that themaximum signature is of non-human origin by calculating the distance tothe robot maximum reference value derived from 3316 (FIG. 33). Thecloser 3410 is to term 3316 (FIG. 33), the higher the probability.

FIG. 34 b shows how these probability values may be interpreted into ascale of values indicating the probability of human or robot access.Boundary A shows that the hit is within previously encountered ormeasured hit times for human access, so the hit has a higher probabilityof human origin. Boundary B shows that the hit is faster (i.e. smaller)than the minimum human reference and therefore the hit has a higherprobability of originating from a robot. As the value of the hitapproaches tref_minimum the robot reference minimum the hit has anincreasingly higher probability of originating from a robot. When thehit becomes less than tr_ref_minimum (i.e., is faster than the minimumrobot reference time, the probability increases that the hit hasoriginated from a robot. The overlap condition where the hit is fasterthan the average robot hit value tr_ref av indicating that the hitoriginates from a fast human or from a robot will be resolved byspecific embodiments.

Other terms affect the probability that the hit is of non-human origin,more specifically the way that the data item has been accessed. Manyinformation servers such as those to be found on the world wide webprovide control files which are used by web search engines and otherrobot and automated agents but are not accessed from a browser undernormal circumstances. Such an example is the robots.txt file that isused to control the way that web search engines such as Yahoo traversethe site. If the robots.txt file has been accessed, there is a very highprobability that the hit is non-human in origin. Moreover, the requesterID for the hit can be recognized in future and the probability termsweighted accordingly. Additionally, some embodiments include hyperlinks,file references, data references or other symbols within a world wideweb page in a form that is invisible, undetectable and/or inactive whenviewed by a Browser, but would be accessed and activated by a robot oranother mechanism not using a browser. These additions are termed“Hickstead mines”, or just “mines”. Accesses to mines can be determinedin the same manner as described for the robot.txt file and such accesswould indicate an extremely high probability that the hit originatedfrom a Robot or other non-human mechanism.

If the parent to a new page has not been accessed immediately prior toaccess of the new page and no other route exists to the new page, thereis an increased probability that the access to the new page is from anon-human source and appropriate action could be taken by the server.Robotic emulation of human behavior can be prohibitively time consumingresulting in the robot making direct access to a known hierarchy offiles. FIG. 35 illustrates how to generate a time ordered list of allfiles accessed by a particular Requester ID, allowing the path to thenew page to be determined. Additionally, a complete list generated is ofall files accessed for a time period t_start, along with the RequesterID's of the accessor of the files. This information is used to determineif a robot or plurality of robots using different Requester ID's isdownloading a set of pages in random order.

Some embodiments employ this technique to determine if the entire site,or portions of the site were repeatedly accessed by the same or same setof Requester ID's over a period of time, a practice that is commonduring competitive analysis of sites as discussed above. Referring toFIG. 35, an index of Requester ID's 3500 includes an index of all theID's that have “hit” the server Each element in the index corresponds toa Requester ID and points to an index 3504, 3510, 3514 of all the dataitems accessed by the requester ID. Each element in the indexes 3504,3510 and 3514 point to the full hit information held in a storage area3502. Another index 3506 includes all the data items that have beenaccessed, each element corresponding to an accessed item and pointing toan index 3508, 3512, 3516, 3518 and 3520 of the Requester ID'soriginating the hit. These elements in turn reference the full hitinformation held in the storage area 3502.

In this way, a list of data items accessed by a Requester ID can bedetermined, and also a list of Requester ID's accessing a particulardata item can be determined. These indexes are used to determine thepath taken to access a data item and also to determine the data itemsaccessed over time. For example, a search made in index 3506 determineswhat Requester ID's had accessed the file robots.txt and each RequesterID is used to index elements in Index 3500 to determine which other dataitems the Requester ID had accessed thus identifying those RequesterID's with a high probability of being non-human in origin and also whatother files had been accessed.

With knowledge of the page hierarchy (FIG. 31 and FIG. 32) obtained asdescribed in the example above, each page or data item is used in index3506 to extract a list of Requester ID's and by referencing 3502 theaccess time is determined. From this information, a frequencydistribution map is generated showing what pages or data items had beenhit by Requester ID's over a time interval. In this way, it isdetermined whether particular requester ID's repeatedly accessed thesame file or data item set over a period of time providing a higherprobability that those Requester ID's are of non-human origin.

Some embodiments provide high speed indexing and retrieval mechanismsfor indexes 3500, 3504, 3506, 3508, 3510, 3512, 3514, 3516, 3518 and3520 and the storage area 3502 allowing substantial real time analysisof the other files the Requester ID had visited within a time period t1to t2. This is used to determine the route that the Requester ID hadtaken prior to the data request, to give a probability of the origin ofthe Requester ID. The process of having to load and re-load parent pageshas been shown in FIG. 31 and FIG. 32 and the indexing mechanism shownin FIG. B7 combined with knowledge of the page hierarchy allows adetermination of the route taken.

Some embodiments use the probability terms singularly or in combinationwith the information gained from the indexing system shown in FIG. 35 toproduce reports, graphs and other displays indicating the origins ofRequester ID's in relation to the data items accessed. Other visualrepresentations may be produced such as graphs, pie-charts, time, dataitem and Requester ID distribution histograms and others such as arerequired by the specific embodiments. Other representations are alsopossible and other indexing may be included into FIG. 35 tocross-reference time stamps and failed or illegal data accesses.

Some embodiments use the probability terms singularly or in combinationwith the information gained from the indexing system shown in FIG. 35 toautomatically take action in the event that the hit origin had a highprobability that it as from a non-human origin. Such action may includerefusing access. The action may also include the obfuscation of the dataitem returned to the requester ID or redirection to another area in theserver. By determining that certain sections of the server arerepeatedly hit by non-human origins, steps may be taken to obfuscatethose sections by, for example, replacing all the text within the pagewith a graphical representation which would still be viewable with abrowser but would be almost useless to an information gatheringautomated agent as described in this invention. Other actions mayinclude pointing to a site which contains useless, confusing, ordeliberately incorrect or entangling information (e.g., deeply recursivelinks), or even monitoring the access to gain useful information aboutthe non-human agent requesting information from the information server.

It is now described with reference to FIGS. 36, 37 and 38 how “clerver”technology is used to facilitate the use of adaptive stores. A “clerver”is a combination of client and server technologies. Client systems ascommonly used in the art cannot easily share their stored informationwith other systems. Servers as commonly used in the art are used formass storage of information that is dispensed to a client in response toan information request by a client.

On the other hand, in accordance with embodiments of the invention,clervers are computational devices (see FIG. 36) that combine theability to gather, process and share data with clients and otherclervers. for example, FIG. 36 shows clervers 3600, 3620, 3650 and 3670connected to a network onto which is connected other clervers, servers,clients and information repositories (generally, information sources180). Each of the clervers may operate independently of or incollaboration with any other connected clerver. Some embodiments 3681use Adaptive stores 3682, 3684; 3686 to hold (in this example) a list ofall data extracted 3682, a list of all accesses to data repositories3684 and a list of all questions 3686 asked of the Clerver by a user,although the number and purpose of such adaptive stores should in no waybe considered restricted to that of this example.

FIG. 37 shows four clervers 3700, 3710, 3726, 3748, 210, 226, 248interconnected on a network onto which are connected other servers,clervers and data repositories 3630. Particular notice is drawn to theway in which the clervers make available onto the network certainadaptive stores 3706, 3714, 3716, 3720, 3722 such that this data can beaccessed by other clervers. Adaptive stores 3702, 2704, 3712, 3724,3742, 3746 are not made available on the network and can only beaccessed directly b the Clerver in which they are held. For example, a3700 can access its “own” adaptive stores 3702 and 3704 and additionallynetworked adaptive stores 3714 and 3716 of clerver B 3710 and networkedadaptive stores 3720 and 3722 of clerver C 3726, but cannot access theother adaptive stores 3712, 3724, 3740, 3742, 3764 which are notconnected to the network. The number of possible interconnections istheoretically unlimited and should in no way be considered limited tothe numbers shown in the FIG. 37 example.

Attention is drawn to another aspect of this interconnection relating tothe ability for the adaptive stores to be interconnected in a number ofindependent networks as illustrated in FIG. 38. Clerver A 3800 includesadaptive stores 3702 and 3804 connected to network 3852 and adaptivestore 3806 connected to network 3856. Clerver C 3810 adaptive store 3816and clerver C 3860 Adaptive Store 3864, 3866 are also connected tonetwork 3856. Network 3850 interconnects adaptive stores 3812, 3814,3840, 3842 and 3846. The networks 3852, 3856 and 3850 are independentbut should not be considered fixed as any adaptive store cantheoretically be connected to any other adaptive store that isinterconnected on the same network. More particularly, the adaptivestore triggers (discussed earlier) can “decide” to connect to anotheradaptive store dynamically at run-time: there need be no static physicalconnection.

As for the comparison operation, in some embodiments, the data is“normalized” before comparison. By normalizing the data, the data to becompared is in a contextually compatible format that can then be used togenerate reports, compare with other contextually similar data, andother purposes that the specific embodiment may require. However,determining the meaning of the data can require access to potentiallylarge amounts of information. For example, if the data is in the form ofa natural language such as English, determining context requiresknowledge of the meaning of the words, by access to a potentially largeknowledgebase of data. For example, the term “bucks” could refer to aquantity of money used in the United States of America or a plurality ofmale deer. Clearly, context is required as well as the meaning of thewords requiring a knowledge of previously encountered or used words,terms, phrases, sentences, etc. Some estimates put the number of wordsin common usage at over 300,000, other estimates put the number ofacronyms in common usage at over 190,000, specific types of industry useextensive terminology, examples being Latin names in biology and medicalnames for drugs.

The context of the data influences the size of the knowledge basecontained within the adaptive stores. An aspect of this inventionprovides a storage system that intelligently adapts to the nature andcontext of the data being stored and accessed, greatly reducing theamount of storage space required. Such intelligent adaptation to datausage is particularly useful for situations where storage space islimited, such as WWW browsers and wireless devices. Another aspect ofthe invention links this aforementioned adaptive storage with contextualknowledge that a particular sequence of data will follow. For example,embodiments processing information about Lions will find the statement“Lions eat Zebra” more likely than “lions eat mountains” since Zebra isa food and mountains are very large rocks. The word “eat” implies thatthe next term or word will be a food product in the context of amammalian carnivore. Embodiments thus pre-load adaptive stores withwords of a carnivorous food context eliminating the need to examine alarger knowledge base.

Consumers and businesses can all benefit from a mechanism thatfacilitates the easy identification and comparison of data. For example,those performing electronic commerce can easily identify and cataloginformation on web repositories of their competitors.

The ability of a data server to process the data requests is oftendependent on the nature of the specific request which can result in theservers inability to service other requests. Data requests may havedifferent priorities; for example, an eCommerce server may place higherpriority to performing the activities associated with a credit cardtransaction than to servicing a request for an image on a web site thatcould take lower priority. In this example, lower priority requestscould be delayed until such time as the higher priority activities arecomplete. The prioritization of data requests and their processingrequirements are performed by specific embodiments in accordance withthis invention or in the form of an external data input. Examples ofsuch inputs are from another data server or another component of thesame server etc. For example, a data server could specify that all datarequests to images files should be delayed by a factor dependent on thetotal load on the server.

Having described certain embodiments of the invention, it will nowbecome apparent to those of skill in the art that other embodimentsincorporating the concepts of the invention may be used. Therefore, theinvention should not be limited to certain embodiments, but rathershould be limited only by the spirit and scope of the disclosure.

1. In a computer network having a plurality of interconnected computerresources, the computer network having associated with it a datarepository that includes a plurality of data items in electronic formatdistributed widely among the interconnected computer resources, a methodof locating portions of the electronic data in the data repository basedon a search query, comprising: processing the search query to determineat least one meaning associated with the search query; and locating theportions of the electronic data based on the determined meaning and inaccordance with a context ascribed to the determined meaning withreference to meanings associated with previous result data, located inresponse to previous search queries.
 2. The method of claim 1, wherein:the previous result data is organized in a particular manner to ascribethe context to the determined meaning; and the locating step includes,based on the particular manner of organization, comparing the determinedmeaning to the meanings associated with previous result data.
 3. Themethod of claim 2, wherein: the comparing step includes: comparing thedetermined meaning to the meanings associated with the previous resultdata in a particular order that is based on the particular manner oforganization.
 4. The method of claim 2, and further comprising:maintaining a store of the meanings associated with the previous resultdata, organized in the particular manner.
 5. The method of claim 4,wherein the particular manner is order of locating the previous resultdata.
 6. The method of claim 3, wherein the order of comparing is basedat least in part on a relative frequency with which the previous resultdata has been accessed.
 7. The method of claim 1, wherein: the searchquery is by a particular user; and the previous search queries includesearch queries by users other than the particular user.
 8. The method ofclaim 7, wherein: the previous result data is organized in the pluralityof results stores in a particular manner that ascribes the context ofthe determined meaning; and the locating step includes, based on theparticular manner of organization, comparing the determined meaning tothe meanings associated with the previous result data.
 9. The method ofclaim 1, wherein: the method further includes maintaining a pointerstore that includes at least one entry pointing to a store of previousresult data; and the locating step includes initially locating the storeof previous result data based on the pointer store.
 10. The method ofclaim 2, and further comprising: maintaining the particular manner oforganization.
 11. The method of claim 10, wherein: the maintaining stepincludes, when a particular previous result data is located based on thesearch query, organizing the previous result data to influence theprominence with which the located particular previous result dataaffects the ascription of context.
 12. The method of claim 11, wherein:the previous result data are co-accessible by a plurality of userspresenting search queries; and in the maintaining step, the organizingstep is executed based on the particular previous result data locatedbased on the search queries presented by the plurality of users.
 13. Themethod of claim 7, wherein: the previous result data are co-accessibleby the particular user and the other users.
 14. A method of emulatingaccess to a data repository by a particular type of access mechanism,comprising: analyzing a collection of representative accesses by theaccess mechanism to determine a collective access signature; andaccessing the data repository by performing actions in accordance withthe determined access signature.
 15. A method of detecting whether acollection of actions to access a data repository is not by a particulartype of access mechanism, comprising: analyzing the collection ofactions to determine a collective access signature; and processing thecollective access signature to determine a probability that thecollection of accesses is not by the particular type of accessmechanism.
 16. The method of claim 15, wherein: the processing stepincludes a step of determining a probability based initially on anindication within the collective access signature of a frequency valuethat corresponds to the frequency with which the accesses are occurring.17. The method of claim 16, wherein: in the processing step, when thefrequency value indicated within the collective access signature isabove a particular threshold, further processing the collective accesssignature to determine a probability that the collection of accesses isnot by the particular type of access mechanism based on other propertiesof the collection of accesses, other than frequency, indicated in thesignature.
 18. The method of claim 16, wherein: in the processing step,the probability determining step includes determining whether thefrequency value is above a particular frequency value threshold.
 19. Themethod of claim 18, wherein: the method further comprises determiningthe particular frequency value threshold based on frequency of prioraccesses to the data repository.
 20. The method of claim 17, wherein:the other properties includes an order in which the accesses of thecollection of accesses occur.
 21. The method of claim 20, wherein themethod includes: determining the order in which the accesses of thecollection of accesses occurs from an order value indicated in theaccess signature; and comparing the actual order against the determinedorder.
 22. The method of claim 17, wherein: the other propertiesincludes at least one of time between accesses and order of accesses.23. The method of claim 17, wherein: the other properties includes anaccess to a data item that would normally only be accessed by anautomated mechanism.
 24. The method of claim 23, wherein: the methodfurther comprises introducing into the data repository the componentsthat would normally only be accessed by an automated mechanism.
 25. Themethod of claim 15, and further comprising: when the collection ofactions to access the data repository is determined to be not by aparticular type of access mechanism, taking at least one of the actionsof: for at least one access after the collection of accesses, modifyingthe data that would otherwise be provided out of the data repository;for at least one access after the collection of accesses, not respondingto the access to the data repository; for at least one access after thecollection of accesses, providing data in addition to the data thatwould otherwise be provided out of the data repository; and for at leastone access after the collection of accesses, delaying a response to theaccess.